feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980
feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCentralizes queue-size logic (new v3/queueLimits utility and environment queueSizeLimit exposure) and adds an LRU cache for environment queue lengths. Refactors queue validation to per-queue semantics (resolveQueueNamesForBatchItems, validateMultipleQueueLimits) and surfaces itemsSkipped/runCount through batch streaming APIs. Introduces per-item retry for batch queue processing, batch-run-count updates, and a TriggerFailedTaskService for creating pre-failed runs. Adds a TTL expiration subsystem (batched TTL consumers, Redis TTL scripts, ttlSystem callback) and lazy get-or-create waitpoints with related waitpoint APIs. Numerous RunEngine/RunQueue/BatchQueue public API additions and tests updated; UI presenters and routes updated to use the single queueSize quota. Estimated code review effort🎯 5 (Critical) | ⏱️ ~180 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review CompleteYour review story is ready! Comment !reviewfast on this PR to re-generate the story. |
a50a5f5 to
aaea8d6
Compare
aaea8d6 to
21dae6f
Compare
…env queue size check
…ndles failures from queue length limit failures and also retries
… var, set at engine level
…properly cleaned up from queues and queues rebalanced after getting expired by ttl system
…not parse correctly
…e a default value of 500
…from the ttl system
94e138b to
d424754
Compare
…xisting run is already complete
…we are doing failed runs
…to batch expire runs
This PR implements a new run TTL system and queue size limits to prevent unbounded queue growth which should help prevent situations where queues enter a "death spiral" where the queue will never be able to catch up.
The main/correct way to battle this situation is to enforce a maximum TTL on all runs (e.g. up to 14 days) where runs that have been queued for that maximum TTL will get auto-expired, making room for newer runs to execute. This required creating a new TTL system that can handle higher workloads and is now deeply integrated into the RunQueue. When runs are enqueued with a TTL, they are added to their normal queue as well as to the TTL queue. When runs are dequeued, they are removed from both their normal queue and the TTL queue. If runs are dequeued by the TTL system, they are removed from their normal queue. Both these dequeues happen automatically so there is no race condition.
The TTL expiration system is also made reliable by expiring runs via a Redis worker, which is enqueued to atomically inside the TTL dequeue lua script.
Optional associated waitpoints
Additionally, this PR implements an optimization where runs that aren't triggered with a dependent parent run will no longer create an associated waitpoint. Associated waitpoints are then lazily created if a dependent run wants to wait for the child run post-facto (via debounce or idempotency), which is a rare situation but is possible. This means fewer waitpoint creations but also fewer waitpoint completions for runs with no dependencies.
Environment Queue Limits
Prevents any single queue growing too large by enforcing queue size limits at trigger time.
Batch trigger fixes
Currently when a batch item cannot be created for whatever reason (e.g. queue limits) the run will never get created, which means a stalled run if using
batchTriggerAndWait. We've updated the system to handle this differently: now when a batch item cannot be triggered and converted into a run, we will eventually (after retrying 8 times up to 30s) we will create a "pre-failed" run with the error details, correctly resolving the batchTriggerAndWait.